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PREDICTIVE LEARNING VIA RULE ENSEMBLES 

By Jerome H. Friedman ^ and Bogdan E. Popescu^ 

Stanford University 

General regression and classification models are constructed as 
linear combinations of simple rules derived from the data. Each rule 
consists of a conjunction of a small number of simple statements 
concerning the values of individual input variables. These rule en- 
sembles are shown to produce predictive accuracy comparable to the 
best methods. However, their principal advantage lies in interpreta- 
tion. Because of its simple form, each rule is easy to understand, as 
is its influence on individual predictions, selected subsets of predic- 
tions, or globally over the entire space of joint input variable values. 
Similarly, the degree of relevance of the respective input variables can 
be assessed globally, locally in different regions of the input space, or 
at individual prediction points. Techniques are presented for auto- 
matically identifying those variables that are involved in interactions 
with other variables, the strength and degree of those interactions, 
as well as the identities of the other variables with which they inter- 
act. Graphical representations are used to visualize both main and 
interaction effects. 

1. Introduction. Predictive learning is a common application in data 
mining, machine learning and pattern recognition. The purpose is to predict 
the unknown value of an attribute y of a system under study, using the 
known joint values of other attributes x = {xi,X2, ■ ■ ■ ,x„) associated with 
that system. The prediction takes the form y = -F(x), where the function 
F(x.) maps a set of joint values of the "input" variables x to a value y for 
the "output" variable y. The goal is to produce an accurate mapping. Lack 
of accuracy is defined by the prediction "risk" 

(1) R{F) = E^yL{y,F{x)), 
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where L{y,y) represents a loss or cost for predicting a value y when the 
actual value is y, and the expected value is over the joint distribution of 
all variables (x, y) for the data to be predicted. Using this definition, the 
optimal mapping ("target") function is given by 

(2) F*(x) = argmmE^yL{y,F{^)). 

F{x) 

With predictive learning, one is given a "training" sample of previously 
solved cases {xj,y,}]^ where the joint values of all variables have been de- 
termined. An approximation -F(x) to F*{x.) is derived by applying a learning 
procedure to these data. 

2. Ensemble learning. Learning ensembles have emerged as being among 
the most powerful learning methods [see Breiman (1996, 2001), Freund and 
Schapire (1996), Friedman (2001)]. Their structural model takes the form 

M 

(3) F(x) =ao+Y^ amfmi^), 

m=l 

where M is the size of the ensemble and each ensemble member ("base 
learner" ) fm (x) is a different function of the input variables x derived from 
the training data. Ensemble predictions i^(x) are taken to be a linear com- 
bination of the predictions of each of the ensemble members, with {flmlo^ 
being the corresponding parameters specifying the particular linear combi- 
nation. Ensemble methods differ in choice of particular base learners (func- 
tion class), how they are derived from the data, and the prescription for 
obtaining the linear combination parameters {om}o^- 

The approach taken here is based on the importance sampled learning 
ensemble (ISLE) methodology described in Friedman and Popescu (2003). 
Given a set of base learners {/^(x)}*^, the parameters of the linear combi- 
nation are obtained by a regularized linear regression on the training data 
{xj, yili , 

N / M \ M 

(4) {am}o^ = arg miri^ ^ L h/j, cq + ^ am/m(xi) + A • ^ la^l. 

The first term in (4) measures the prediction risk (1) on the training sam- 
ple, and the second (regularization) term penalizes large values for the co- 
efficients of the base learners. The influence of this penalty is regulated by 
the value of A > 0. It is well known that for this ("lasso") penalty, larger 
values of A produce more overall shrinkage as well as increased dispersion 
among the values {|am|}i^5 often with many being set to zero [see Tibshirani 
(1996), Donoho et al. (1995)]. Its value is taken to be that which minimizes 
an estimate of future prediction risk (1) based on a separate sample not 
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used in training, or by full (multi-fold) cross-validation. Fast algorithms for 
solving (4) for all values of A > 0, using a variety of loss functions L{y,y), 
are presented in Friedman and Popescu (2004). 

The base learners {/^(x)}*'^ used in (3) and (4) to characterize the en- 
semble are randomly generated using the perturbation sampling technique 
described in Friedman and Popescu (2003). Each one is taken to be a sim- 
ple function of the predictor variables characterized by a set of parameters 
P = ipi,P2,---)- That is, 

(5) /„^(x) =/(x;p^), 

where Pm represents a specific set of joint parameter values indexing a spe- 
cific function fm{^) from the parameterized class /(x;p). Particular choices 
for such parameterized function classes are discussed below. 

Given a function class, the individual members of the ensemble are gener- 
ated using the prescription presented in Friedman and Popescu (2003) and 
shown in Algorithm 1. 

Algorithm 1 (Ensemble generation). 

1 Fo{^) =argmmcJ2iLiL{yi,c) 

2 For m = 1 to M { 

3 Pm =argminpX;je5„{»?)-^(y'M-^m-i(xi) +/(xi;p)) 

4 /m(x) =/(x;pm) 

5 Fm(x) = Fm-i(x) + 1/ • /„^(x) 

7 ensemble = { /m (x) } *^ 

In line 3, Smiv) represents a different subsample of size t] < N ran- 
domly drawn without replacement from the original training data, Smiv) C 
{:Ki,yi}i . As discussed in Friedman and Popescu (2003), smaller values of t] 
encourage increased dispersion (less correlation) among the ensemble mem- 
bers {/m(x)}]^ by training them on more diverse subsamples. Smaller values 
also reduce computation by a factor of N/ij. 

At each step m, the "memory" function 

m— 1 

Fm_i(x) = Fo(x) + i/- ^ /fc(x) 

k=l 

contains partial information concerning the previously induced ensemble 
members {/fc(x)}™~^ as controlled by the value of the "shrinkage" parameter 
< < 1. At one extreme, setting 1^ = causes each base learner fm{^) to be 
generated without reference to those previously induced, whereas the other 
extreme v = 1 maximizes their influence. Intermediate values < < 1 vary 
the degree to which previously chosen base learners effect the generation of 
each successive one in the sequence. 
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Several popular ensemble methods represent special cases of Algorithm 1. 
A "bagged" ensemble [Breiman (1996)] is obtained by using squared-error 
loss, L{y,y) = {y — y)^, and setting v = 0, and r/ = N/2 or, equivalently, 
choosing Sm (line 3) to be a bootstrap sample [Friedman and Hall (2007)]. 
Random forests [Breiman (2001)] introduce increased ensemble dispersion by 
additionally randomizing the algorithm ("arg min," line 3) used to solve for 
the ensemble members (large decision trees). In both cases the coefficients in 
(3) are set to ao = y, {am = l/M}*^ so that predictions are a simple average 
of those of the ensemble members. 

AdaBoost [Freund and Schapire (1996)] uses exponential loss, L{y,y) = 
exp(—y ■ y) for y S { — 1, 1}, and is equivalent to setting u = \ and rj = N in 
Algorithm 1 . Predictions are taken to be the sign of the final memory func- 
tion Fm(x). mart [Friedman (2001)] uses a variety of loss criteria L{y,y) 
for arbitrary y, and in default mode sets v = 0.1 and r] = N/2. Predictions 
are given by Fm(x). 

Friedman and Popescu (2003) experimented with a variety of joint (z/, rj) 
values for generating ensembles of small decision trees, followed by (4) to es- 
timate the linear combination parameters. Their empirical results indicated 
that small but nonzero values of i/ (z^ ~ 0.01) performed best in this context. 
Results were seen to be fairly insensitive to the value chosen for r], provided 
it was small (77 < N/2) and grew less rapidly than the total sample size 
(?7 ~ \/iV) as becomes large {N > 500). 

Although, in principle, most of these procedures can be used with other 
base learners, they have almost exclusively been applied with decision trees 
[Breiman et al. (1983), Quinlan (1993)]. 

3. Rule based ensembles. The base learners considered here are simple 
rules. Let Sj be the set of all possible values for input variable xj, xj £ Sj, 
and Sjm be a specified subset of those values, Sjm ^ Sj. Then each base 
learner takes the form of a conjunctive rule 

n 

(6) r.m(x) = Y[ £ Sjm), 

where /(•) is an indicator of the truth of its argument. Each such base 
learner assumes two values ^^(x) G {0,1}. It is nonzero when all of the 
input variables realize values that are simultaneously within their respective 
subsets {xj £ Sjm}i - For variables that assume orderable values, the subsets 
are taken to be contiguous intervals 

defined by a lower and upper limit, tjm < Xj < Ujm- For categorical variables 
assuming unorderable values (names), the subsets are explicitly enumerated. 
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Such rules (6) can be regarded as parameterized functions of x (5), where the 
parameters pm are the quantities that define the respective subsets {sjm}- 
Note that for the case in which the subset of values Sjm (real or cate- 
gorical) appearing in a factor of (6) is in fact the entire set Sjm = Sj, the 
corresponding factor can be omitted from the product. In this case the rule 

(6) can be expressed in the simpler form 

(7) rm(x) = Y[ ^(.^j ^ ^jm)- 

The particular input variables Xj for which Sjm 7^ Sj are said to be those 
that "define" the rule rm(x). For purposes of interpretation, it is desirable 
that the ensemble be comprised of "simple" rules each defined by a small 
number of variables. As an example, the rule 

r 1(18 < age < 34) 
rm(x) = < -/(marital status G {single, living together-not married}) 
[ -/(householder status = rent) 

is defined by three variables, and a nonzero value increases the odds of 
frequenting bars and night clubs. 

3.1. Rule generation. One way to attempt to generate a rule ensemble 
is to let the base learner /(x;p) appearing in Algorithm 1 take the form 
of a rule (6) and then try to solve the optimization problem on line 3 for 
the respective variable subsets {sjm}- Such a (combinatorial) optimization 
is generally infeasible for more that a few predictor variables, although fast 
approximate algorithms can be employed [Cohen and Singer (1999), Weiss 
and Indurkhya (2000)]. The approach used here is to view a decision tree as 
defining a collection of rules and take advantage of existing fast algorithms 
for producing decision tree ensembles. That is, decision trees are used as the 
base learner /(x;p) in Algorithm 1. Each node (interior and terminal) of 
each resulting tree /m(x) produces a rule of the form (7). 

This is illustrated in Figure 1 which shows a typical decision tree with 
five terminal nodes that could result from using a decision tree algorithm in 
conjunction with Algorithm 1. Associated with each interior node is one of 
the input variables Xj. For variables that realize orderable values, a partic- 
ular value of that variable ("split point") is also associated with the node. 
For variables that assume unorderable categorical values, a specified subset 
of those values replaces the split point. For the tree displayed in Figure 1, 
nodes and 4 are associated with orderable variable X14 with split points 
u and t respectively, node 1 is associated with categorical variable X32 with 
subset values {a,6, c}, and node 2 is associated with categorical variable x^ 
with the single value {z}. 
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Fig. 1. A decision tree. The rule corresponding to each node is given by the product of 
the indicator functions associated with all of the edges on the path from the root to that 
node. 

Each edge of the tree connecting a "parent" node to one of its two "daugh- 
ter" nodes represents a factor in (7) contributing to the rules corresponding 
to ah descendent nodes of the parent. These factors are shown in Figure 1 
for each such edge. The rule corresponding to any node in the tree is given 
by the product of the factors associated with all of the edges on the path 
from the root to that node. Note that there is no rule corresponding to the 
root node. As examples, in Figure 1 the rules corresponding to nodes 1, 4, 
6, and 7 are respectively: 

ri(x) = /(xi4 < u), 

r4(x) = /(Xi4 < u) ■ I{X32 i. {a, 6, c}), 
r6(x) = l(t < xi4 < u) ■ I{x32 i- {a,6,c}), 
r7(x) = /(xi4 > u) • l(xj = z). 

3.2. Rule fitting. The collection of all such rules derived from all of 
the trees {fm{'^)}i^ produced by Algorithm 1 constitute the rule ensem- 
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ble {rfc(x)}{^. The total number of rules is 

M 

(8) K=Y,2{tm-l), 

m=l 

where tm is the number of terminal nodes for the mth tree. The predictive 
model is 

K 

(9) F(x) = ao + ^ afcrfc(x), 

k=l 

with 

TV / K \ K 

(10) {ak}o = arg min^ ^ L h/j , aq + X! Ofcrfc(xi) j + A • ^ \ak\. 

Fast algorithms for solving (10) for all values of A > 0, and procedures for 
choosing a value for A, are discussed in Friedman and Popescu (2004). 

The solution to (10) for A > is not equivariant under different scaling 
transformations applied to each of the predicting rules rfc(x). Increasing the 
scale of a rule by rfc(x) <— bk ■ ^^(x) {hk > 1) and decreasing its corresponding 
coefficient Uk ^ a-k/bk produces the same loss in the first term of (10), but 
reduces its contribution to the second penalty term. Therefore, the coeffi- 
cients of rules with larger scales are penalized less than those with smaller 
scales. The scale of a rule is characterized by its standard deviation 

(11) tk = \lsk{l-Sk), 
where Sk is its support on the training data 

1 ^ 

(12) Sk = —^rk{^i). 

i=l 

A common practice is to give all predictors equal a priori influence, for 
example, by replacing each rule by a normalized version rfc(x) ^ rfc(x)/tfc in 
(10). The strategy applied here is to use the original unnormalized rules in 
(10). This places increased penalty on coefficients of rules with very small 
support Sk — and on those with very large support ~ 1. The overall 
effect is to reduce the variance of the estimated model (9) since rules with 
such small support, or the complement of those with such large support, are 
each defined by a correspondingly small number of training observations. 

3.3. Tree size. As seen in Figure 1 the size of each tree, as characterized 
by the number of its terminal nodes, along with the tree topology, deter- 
mines the maximum number of factors appearing in the rules (7) derived 
from that tree. The topology of each individual tree is determined by the 
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data. However, larger trees generally allow more complex rules to be pro- 
duced in terms of the number of variables (factors) that define them. For 
example, the smallest trees with only two terminal nodes ("stumps") gener- 
ate rules limited to one factor in (7), whereas an L terminal node tree can, 
in principle, generate rules involving up to L — 1 factors. Thus, controlling 
tree size directly controls maximum complexity, and indirectly the average 
complexity, of the rules that comprise the ensemble. 

Controlling tree size, and thereby average rule complexity, also influences 
the type of target functions (2) that are most easily approximated by the 
ensemble. In order to capture interaction effects involving I variables, the 
ensemble must include rules with / or more factors. Thus, targets that involve 
strong high order interaction effects require larger trees than those that are 
dominately influenced by main effects and/or low order interactions. On the 
other hand, for a given size K (8), ensembles comprised of a large fraction of 
high order interaction rules will necessarily involve fewer of lower order that 
are best able to capture main and low order interaction effects. Therefore, 
larger trees can be counter productive for targets of this latter type. The best 
tree size is thus governed by the nature of the (unknown) target function. 

The strategy used here is to produce an ensemble of trees of varying sizes 
from which to extract the rules by letting the number of terminal nodes tm 
of each tree be a random variable 

t^ = 2 + //(7). 

Here 7 is randomly drawn from an exponential distribution with probability 

(13) Pr(7) = exp(-7/(L-2))/(Z-2), 

and //(7) is the largest integer less than or equal to 7. The quantity L > 2 
represents the average number of terminal nodes for trees in the ensemble. 
For L = 2, the entire ensemble will be composed of rules each involving only 
one of the input variables and thereby capture main effects only. Larger val- 
ues produce trees of varying size tm., mostly with tm < L, but many with 
tm > L and some with tm^ L producing some rules capable of captur- 
ing high order interactions, if present. The fitting procedure (10) can then 
attempt to select those rules most relevant for prediction. The use of an 
exponential distribution (13) counters the tendency of trees (of a given size) 
to produce more rules involving a larger number of factors owing to their 
hierarchical (binary tree) topology. The overall result is a more evenly dis- 
tributed ensemble in terms of the complexity of its rules. 

The average tree size L is a "meta" -parameter of the procedure that con- 
trols the distribution of the complexity of the rules {rfc(x)}{^ comprising the 
ensemble. A choice for its value can be based on prior suspicions concerning 
the nature of the target F* (x) , or one can experiment with several values 
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using an estimate of future predictive accuracy based on an independent 
sample or cross-validation. Also, examination of the actual rules chosen for 
prediction in (10) can suggest potential modifications. 

3.4. Loss functions. Any predictive learning method involves the spec- 
ification of a loss function L(y,F) that characterizes the loss or cost of 
predicting an outcome or response value F when the actual value is y. As 
described in Friedman and Popescu (2003, 2004), the ensemble procedures 
presented here can be implemented with a variety of different loss criteria. 
Specific choices can have a substantial effect on predictive models estimated 
from data, and are appropriate in different settings. For example, if the 
deviations from the target F*{x) (2) follow a (homoskedastic) Gaussian dis- 
tribution 

(14) y,-iV(F*(xi),(T2), 
then squared-error loss 

(15) L{y,F) = {y-F)^ 

is most appropriate. 

For other distributions of a numeric outcome variable y, and especially in 
the presence of outliers, the Huber (1964) loss 

(16) Lf^y \y-F\<S, 

^'^^ ^(^'^)-U(|y-F|-V2), \y-F\>6, 

provides increased robustness, while sacrificing very little accuracy in sit- 
uations characterized by (14) [see Friedman and Popescu (2004)]. It is a 
compromise between squared-error loss (15) and absolute deviation loss 
L{y,F) = \y — F\. The value of the "transition" point 6 differentiates the 
residuals that are treated as outliers being subject to absolute loss, from the 
other residuals subject to squared-error loss. Its value is taken to be the ath 
quantile of the data absolute residuals {\yi — F(xj)|}^, where the value of 
a controls the degree of robustness (break down) of the procedure; smaller 
values produce more robustness. For the simulated regression problems illus- 
trated in the following squared-error loss, (15) is used, whereas for the real 
data example Huber loss, (16) with a = 0.9 was employed to guard against 
potential outliers. 

For binary classification y £ {—1, 1}, a variety of loss criteria have been 
proposed [see Hastie, Tibshirani and Friedman (2001)]. Here we use the 
squared-error ramp loss 

(17) L{y,F) = [y- min(-l, max(l, F))]^ 

introduced and studied in Friedman and Popescu (2001, 2004). It was shown 
to produce comparable performance to other commonly used loss criteria, 
but with increased robustness against mislabeled training cases. 
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4. Accuracy. An important property of any learning method is accuracy 
as characterized by its prediction risk (1). As noted in Section 2, decision 
tree ensembles are among the most competitive methods. Friedman and 
Popescu (2001) compared the performance of several decision tree ensemble 
methods in a simulation setting. These included bagging, random forests, 
boosting, and a variety of ISLEs using Algorithm 1 to construct the tree 
ensembles with various joint values for tj and followed by (4) to estimate 
the linear combination parameters. Here we compare the performance of 
rule based ensembles discussed in Section 3 to best performing tree based 
ensembles studied there. 

The simulation consisted of 100 data sets, each with = 10000 obser- 
vations and n = 40 input variables. Each data set was generated from the 
model 

(18) {y, = F*(xi)+ejf, 

with -F*(x) being a different target function for each data set. These 100 
target functions were themselves each randomly generated so as to produce 
a wide variety of different targets in terms of their dependence on the input 
variables x. Details concerning this random function generator are presented 
in Friedman (2001) and also in Friedman and Popescu (2003). The input 
variables were randomly generated according to a standard Gaussian dis- 
tribution Xj ~ A^(0, 1). The irreducible error e was also randomly generated 
from a Gaussian, e ~ A^(0, ci^), with = Varx F*{-x.) to produce a one-to-one 
signal-to-noise ratio. In addition to regression, data for binary classification 
was produced by thresholding the response values for each data set at their 
respective medians 

(19) {iji = sign(yi - median({yfc}f ))}f . 

The resulting optimal decision boundaries for each data set are quite differ- 
ent and fairly complex. 

Here we present a comparison of four methods. The first "MART" [Fried- 
man (2001)] is a popular tree boosting method. The second "ISLE" is the 
best performing tree ensemble considered in Friedman and Popescu (2003) 
as averaged over these 100 data sets. It uses Algorithm 1 to generate the trees 
with rj = N/5 and = 0.01, followed by (4) to estimate the linear combina- 
tion parameters. In both cases the ensembles consisted of 500 six-terminal 
node trees. The third method "RuleFit" here uses exactly the same tree 
ensemble produced by ISLE to facilitate comparison, but then extracts the 
ten rules associated with each of the trees as described in Section 3.1. The 
resulting collection of K = 5000 rules (8) is then used to form the predic- 
tive model (9), (10). The last method RuleFit 200 uses the same procedure 
except that only the first 200 trees are used to extract K = 2000 rules for 
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Fig. 2. Inaccuracy comparisons between tree ensemble methods (Mart, ISLE) and rule 
based ensembles (RuleFit, RuleFit 200). 



the final model. Although a large number of rules are used to fit the model 
in (10), typically only a small fraction (~ 10%) have nonzero solution coef- 
ficient values and are thus required for prediction in (9). 
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The upper left panel of Figure 2 shows the distributions (box plots) of 
the scaled absolute error 

(20) r - E^i\Ft{^)-Fjim , .... 

~ E^[\F*{^) - medianF;(x)|] ' ' " 

over the 100 regression data sets for each of the four methods. Here -Ff (x) 
is the true target function for the Ith. data set, and Fji{x) is the correspond- 
ing estimate for the jth method (j = 1,4). One sees that these 100 target 
functions represent a wide range of difficulty for all methods and that on 
average RuleFit provides slightly better performance. Using rules based on 
only 200 trees is still competitive with the 500 tree MART ensemble, but 
somewhat inferior to the 500 tree ISLE on these typically fairly complex 
target functions. 

The upper right panel of Figure 2 shows the corresponding distributions 
of the comparative absolute error defined by 

(21) Cji = eji/mm{eki}t=i. 

This quantity facilitates individual comparisons by using the error of the 
best method for each data set to calibrate the difficulty of each respective 
problem. The best method j* = argminj{ej7}|^^ for each data set receives 
the value Cj*i = 1, and the others larger values in proportion to their average 
error (20) on that data set. Here one sees that RuleFit based on 500 trees 
yields the best performance, or close to it, on nearly all of the 100 data sets. 
There are a few (~ 5) for which one of the other methods was distinctly 
better. Of course, there are many for which the converse holds. 

The lower panels of Figure 2 show the corresponding results for the clas- 
sification (19). Here lack of performance is measured in terms of error rate 

(22) e,i = ^x/[y/sign(F,i(x))]. 

Again, these 100 classification problems present varying degrees of difficulty 
for all methods with error rates ranging by roughly a factor of three. Both 
rule based methods exhibit slightly superior average classification perfor- 
mance to the tree based ensembles. This is especially refiected in the corre- 
sponding comparative error rates (21), (22) shown in the lower right panel 
where RuleFit based on 500 trees was the best on almost every data set, 
and even RuleFit 200 was substantially better than either of the tree based 
ensembles with 500 trees. 

The results presented in Figure 2 suggest that the rule based approach to 
ensemble learning described in Section 3 produces accuracy comparable to 
that based on decision trees. Other tree based ensemble methods including 
bagging and random forests were compared to those presented here (MART, 
ISLE) in Friedman and Popescu (2003), and seen to exhibit somewhat lower 
accuracy over these 100 regression and classification data sets. Thus, rule 
based ensembles appear to be competitive in accuracy with the best tree 
based ensembles. 
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5. Linear basis functions. With ensemble learning there is no require- 
ment that the basis functions {/m(x)}]^ in (3) and (4) must be generated 
from the same parametric base learner (5). Other basis functions can be in- 
cluded, either generated from another parametric family using Algorithm 1, 
or by some other means. For increased accuracy, the different families should 
be chosen to complement each other in that each is capable of closely ap- 
proximating target functions (2) for which the others have difficulty. For 
the purpose of interpretation, each such family should also produce easily 
understandable basis functions. 

Among the most difficult functions for rule (and tree) based ensembles to 
approximate are linear functions 



(23) F*{^) = bo + Y,bjXj, 



for which a substantial number of the coefficients bj have relatively large 
absolute values. Such targets can require a large number of rules for accurate 
approximation. Especially if the training sample is not large and/or the 
signal-to-noise ratio is small, it may not be possible to reliably estimate 
models with large enough rule sets. Also, models with many roughly equally 
contributing rules are more difficult to interpret. 

These considerations suggest that both accuracy and interpretability might 
be improved by including the original variables {xj}" as additional basis 
functions in (9) and (10) to complement the rule ensemble. In the interest of 
robustness against input variable outliers we use the "Winsorized" versions 

(24) ^j(^j) = min((5^,max(5^,Xj)), 

where 6~ and 6^' are respectively the P and (1 — /?) quantiles of the data 
distribution {xij}^i for each variable xj. The value chosen for /? reflects 
ones, prior suspicions concerning the fraction of such outliers. Depending on 
the nature of the data, small values (/3 ~ 0.025) are generally sufficient. 
With these additions, the predictive model (9) becomes 

K n 

(25) F(x) = ao + E ^krki^) + E ^i^i(^j)' 

A;=l j=i 



with 



N / K 



({afc}^,{fej}i) = arg min ^L\yi,aQ + ^akrk{yii) +^hjlj{ 



(26) 



/ K n 

\k=l j=l 



k=l 
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In order to give each linear term (24) the same a priori influence as a typical 
rule, its normalized version 

lj{xj) ^ 0.4 • lj{xj)/ std{lj{xj)) 

is used in (26), and then the corresponding solution coefficients {bj}i (and 
intercept clq) are transformed to reference the original lj{xj) (24). Here 
std{lj{xj)) is the standard deviation of lj{xj) over the training data and 
0.4 is the average standard deviation (11) of rules with a uniform support 
distribution ~ C/(0, 1). 

Owing to the selective nature of the lasso penalty in (26), many of the 
rule coefficient estimates as well as those bj of the less influential linear 
variables will often have zero values, and thus need not appear in the final 
predictive model (25). 

5.1. Illustration. To illustrate the potential benefit of including the orig- 
inal variables (24) as part of the ensemble, we consider simulated data gen- 
erated from the model 

{5 2 1 ^ 

yi = 10-ne"''"^+E^*^ + ^4 ' 
j=l j=G } 1=1 

with N = 10000 observations and n = 100 input variables, of which 65 have 
no influence on the response y. There is a strong nonlinear dependence 
on the first five input variables and a linear dependence of equal strength 
on 30 others. All input variables were randomly generated from a uniform 
distribution, Xij ~ U{0, 1), and the irreducible noise was generated from a 
Gaussian distribution, ej ~ A^(0,cj^), with a chosen to produce a two-to-one 
signal-to-noise ratio. 

Figure 3 shows the distribution (box plots) of the scaled absolute error 
(20) over 100 data sets randomly generated according to the above pre- 
scription, for three ensembles. The first "linear" involves no rules; only the 
n = 100 linear variables (24) comprise the ensemble. The second ensemble 
"rules" consists of i^' = 2000 rules generated as described in Section 3. The 
third ensemble "both" is the union of the first two; it includes the 100 linear 
variables and the 2000 rules. As seen in Figure 3, the purely linear model 
exhibits relatively poor performance; it has trouble capturing the highly 
nonlinear dependence on the first five input variables (27). The ensemble 
based only on rules provides somewhat improved performance by being bet- 
ter able to approximate the nonlinearity while crudely approximating the 
linear dependence by piecewise constants. The ensemble based on both linear 
variables and rules here provides the highest accuracy. The selection effect 
of the lasso penalty in (26) tends to give high influence to the best rules 
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T 1 

linear rules twin 

Fig. 3. Average absolute error for linear model, rules only model, and combined rules 
and linear base learners. 

for approximating the nonlinear dependencies as well as to the appropriate 
linear terms (24) for capturing the linear component in (27). 

This example was constructed to especially illustrate the potential ad- 
vantage of including linear basis functions as part of rule based ensembles. 
In many applications the corresponding improvement is less dramatic. For 
example, the target functions generated by the random function generator 
used in Section 4 tend to be very highly nonlinear [see Friedman (2001)] 
and the performance of the rule based ensembles including linear functions 
(not shown) was virtually identical to that based on rules alone as shown in 
Figure 2. Also in many applications the numeric variables realize a relatively 
small number of distinct values and the piecewise constant approximations 
based on relatively few rules are at less of a disadvantage at capturing linear 
dependencies. Including linear functions in the basis provides the greatest 
improvement in situations where there are a substantial number of relevant 
numeric variables, each realizing many distinct values, on which the target 
has an approximate linear dependence. However, even in settings unfavor- 
able to linear basis functions, as in Section 4, their inclusion seldom degrades 
performance again owing to the selection effect of the lasso penalty in (26). 
In all the examples presented below the ensemble includes the linear func- 
tions (24) for all of the input variables as part of the basis. 

6. Rule based interpretation. Rules of the form (7) represent easily un- 
derstandable functions of the input variables x, as do the linear functions 
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(24). Although a large number of such functions participate in the initial 
ensemble, the fitting procedure (26) generally sets the vast majority (~ 80% 
to 90%) of the corresponding coefficient estimates ({«fe}f , {^jji ) to zero. 
As noted above, this selection property is a well-known aspect of the lasso 
penalty in (26). The remaining predictors [rules (7) or linear (24)] will have 
varying coefficient values depending on their estimated predictive relevance. 

A commonly used measure of relevance or importance Ik of any predictor 
in a linear model such as (25) is the absolute value of the coefficient of the 
corresponding standardized predictor. For rules, this becomes 

(28) Ik = \ak\- ^/sJ^^^^, 

where Sk is the rule support (12). For the linear predictors (24), the corre- 
sponding quantity is 

(29) Ij = \bj\- std{lj{xj)), 

where std{lj{xj)) is the standard deviation of lj{xj) over the data. Those 
predictors (rules or linear) with the largest values for (28) and (29) are the 
most influential for prediction based on the predictive equation (25). These 
can then be selected and examined for interpretation. 

The importance measures (28) and (29) are global in that they reflect the 
average influence of each predictor over the distribution of all joint input 
variable values. A corresponding local measure of influence at each point x 
in that space is for rules (7) 

(30) 4(x) = lofcl • |rfc(x) - Sfcl, 
and for the linear terms (24) 

(31) Iji^i) = l^il • \h{xi)-h\^ 

where Ij is the mean of lj{xj) over the training data. These quantities mea- 
sure the (absolute) change in the prediction F{x.) when the corresponding 
predictor (rfc(x) or lj{xj)) is removed from the predictive equation (25) and 
the intercept ao is adjusted accordingly. That is, ao <— ao — ^kSk for rules, 
and ao <— ao — bjlj for linear predictors. Note that the average (root-mean- 
square) of (30) and (31) over all x values equates to the corresponding global 
measures (28) and (29). 

For a given coefficient value |afc|, the importance (30) of the corresponding 
rule for a prediction at x depends on its value rfc(x) G {0, 1} at that point, as 
well as its global support (12). A rule is said to "fire" at a point x if rfc(x) = 1. 
From (30) a rule that generally does not fire over the whole space {sk small) 
will have higher importance in regions where it does fire. Conversely, high 
support rules that usually fire will be correspondingly more important at 
points X where they do not fire, rfc(x) = 0. This symmetry is a consequence 
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of the fact that replacing a particular rule (x) by its complement l — r^ (x) 
produces an equivalent fitted linear model, so that either one should be 
assigned the same influence as reflected in (28) and (30). 

The quantities (30) and (31) permit one to evaluate the relative influence 
of the respective predictors (rules or linear) for individual predictions -F(x) 
at X. Those judged most influential can then be examined for interpret- 
ing that particular prediction. These quantities can also be averaged over 
selected subregions S of the input variable space 



where |5| is the cardinality of S. For example, one might be interested in 
those predictors that most heavily influence relatively large predicted values 



where the threshold u might be a high quantile of the predictions {F{'x.i)}i 
over the data set. Similarly, one might define S to be the set of lowest 
predicted values 



with t being a low quantile. In classification, y € {—1,1}, one might be 
interested in those rules that most heavily influence the predictions for each 
of the two respective classes. In this case S = {xj|yj = 1} or = {xj|yj = — 1} 
would be appropriate. 

As with any linear model, the importance measures defined above are in- 
tended to estimate the influence of each individual predictor (rule or linear) 
after accounting for that of the others appearing in the ensemble. To the 
extent that the coefficient estimates are accurate, they will reflect the cor- 
responding influence on the target function (2). These influence measures 
may or may not reflect the usefulness of individual predictors in the absence 
of others. For example, a predictor on which the target function (2) has no 
dependence at all may be useful if it is highly correlated with an important 
predictor, and the latter is removed from the ensemble. The influence mea- 
sures used here are based on the joint contributions of all members of the 
ensemble. 

7. Input variable importance. In predictive learning a descriptive statis- 
tic that is often of interest is the relative importance or relevance of the 
respective input variables {xi,X2, ■ ■ ■ ,Xn) to the predictive model. For the 
models (25) considered here, the most relevant input variables are those 
that preferentially define the most influential predictors (rules or linear) ap- 
pearing in the model. Input variables that frequently appear in important 
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predictors are judged to be more relevant than those that tend to appear 
only in less influential predictors. 

This concept can be captured by a measure of importance Ji(x) of input 
variable xi at each individual prediction point x as 

(35) JKx) = /i(x) + 4(x)/mfc. 

Here //(x) is the importance (31) of the linear predictor (24) involving xi, 
and the second term sums the importances of those rules (7) that contain 
xi (xi € Tfc) each divided by the total number of input variables nik that 
define the rule. In this sense the input variables that define a rule equally 
share its importance, and rules with more variables do not receive exagger- 
ated influence by virtue of appearing in multiple input variable importance 
measures. 

The distribution of { J/(x)}'" (35) can be examined to ascertain the relative 
influence of the respective input variables locally at particular predictions 
X. As with rules, these quantities can be averaged over selected subregions 
of the input variable space using (32), or over the whole space using (28) 
and (29), in place of the corresponding local measures in (35). Illustrations 
are provided in the data examples below. 

8. Interaction effects. A function F{x) is said to exhibit an interaction 
between two of its variables Xj and Xk if the difference in the value of -F(x) as 
a result of changing the value of Xj depends on the value of x^. For numeric 
variables, this can be expressed as 



g^F(x) 

dxj dxk 



2 



>0 



or by an analogous expression for categorical variables involving finite differ- 
ences. If there is no interaction between these variables, the function F(x.) 
can be expressed as the sum of two functions, one that does not depend on 
Xj and the other that is independent of x^'- 

(36) F(x) = /\,(x\,) + Afc(x\fc). 

Here x\j and xy^ respectively represent all variables except Xj and Xk- If a 
given variable xj interacts with none of the other variables, then the function 
can be expressed as 

(37) F(x) = /,(x,) + A,(x\,), 

where the first term on the right is a function only of xj and the second is 
independent of xj. In this case F{x.) is said to be "additive" in Xj. 
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A function -F(x) is said to have an interaction between three (numeric) 
variables xj, and xi if 

^ [ dxj dxk dxi J ' 

again with an analogous expression involving finite differences for categor- 
ical variables. If there is no such three-variable interaction, -F(x) can be 
expressed as a sum of three functions, each independent of one of the three 
variables 

(38) F(X) = /\,(X\,) + /\fc(x\fc) + /y(xy). 

Here x\j , x\^. and xy each respectively represent all of the variables except 
Xj, Xk and x;. Analogous expressions for the absence of even higher order 
interaction effects can be similarly defined. 

Knowing which variables are involved in interactions with other variables, 
the identities of the other variables with which they interact, as well as the 
order and strength of the respective interaction effects can provide useful 
information about the predictive process as represented by the target func- 
tion F*{x.) (2). To the extent that the predictive model F{x) (25), (26) 
accurately represents the target, one can infer these properties by studying 
its interaction effects. 

As noted in Section 3.3, in order for the predictive model -F(x) (25) to 
capture an interaction among a specified subset of its variables, it is nec- 
essary that it contain rules (7) jointly involving all of the variables in the 
subset. This is, however, not a sufficient condition for the presence of such 
an interaction effect in F{x). Different rules jointly involving these variables 
can combine to substantially reduce or possibly even eliminate various in- 
teraction effects between them as refiected in the overall model. Thus, the 
mere presence of rules involving multiple variables does not guarantee the 
existence of substantial interactions between the respective variables that 
define them. In order to uncover actual interaction effects, it is necessary 
to analyze the properties of the full predictive equation, not just individ- 
ual components. Here we use the properties of partial dependence functions 
[Friedman (2001)] to study interaction effects in the predictive model. 

8.1. Partial dependence functions. Given any subset x^j of the predictor 
variables indexed by s C {1, 2, . . . , n}, the partial dependence of a function 
-F(x) on Xg is defined as 

(39) F,(x,) = ^x\jF(x„x\,)], 

where x^ is a prescribed set of joint values for the variables in the subset, 
and the expected value is over the marginal (joint) distribution of all vari- 
ables xy^ not represented in x^. Here x = (xs,x\j,) is the entire variable set. 
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Partial dependence functions were used in Friedman (2001) to graphically 
examine the dependence of predictive models on low cardinality subsets of 
the variables, accounting for the averaged effects of the other variables. They 
can be estimated from data by 

1 ^ 

(40) F,(x,) = -^F(x„x,\,), 

i=l 

where {xjyg}{^ are the data values of xy^. Here we use the properties of cen- 
tered partial dependence functions to uncover and study interaction effects. 
In this section all partial dependence functions as well as the predictive 
model -F(x) (25) are considered to be centered to have a mean value of zero. 

If two variables Xj and do not interact, then from (36) the partial 
dependence of -F(x) on x^ = {xj,Xk) can be decomposed into the sum of the 
respective partial dependences on each variable separately: 

(41) Fjk{xj,Xk) = Fj{xj) + Fk{xk). 

Furthermore, if a given variable Xj does not interact with any other variable, 
then from (37) one has 

(42) F(x) = F,-(x,) + F\,(x\,.)- 

Here F\j{x\j) is the partial dependence of F{x) on all variables except Xj. 

If variables xj, x^ and xi do not participate in a joint three- variable 
interaction, then from (38) the partial dependence of -F(x) on these three 
variables can be expressed in terms of the respective lower order partial 
dependencies as 

Fjki{xj,Xk,xi) =Fjk{xj,Xk) +Fji{xj,xi) +Fki{xk,xi) 

(43) 

-Fj{xj)-Fk{xk)-Fi{xi). 

Analogous relationships can be derived for the absence of higher order in- 
teractions. These properties (41)-(43) of partial dependence functions are 
used to construct statistics to test for interaction effects of various types. 

To test for the presence of an interaction between two specified variables 
{xj,Xk), the statistic 

TV TV 

(44) H^i, = Y^[^jkixij,Xik) -Fj{xij) - Fk{xik)]^ I ^PJ^ixij^Xik) 

i=l i=l 

can be used based on (41) and the empirical estimates (40). It measures the 
fraction of variance of Fjk{xj,Xk) not captured by Fj{xj) + Fkixk) over the 
data distribution. It will have a value of zero if the predictive model F{x) 
(25), (26) exhibits no interaction between xj and Xk and a correspond- 
ingly larger value for a stronger interaction effect between them. Similarly, a 
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statistic for testing whether a specified variable xj interacts with any other 
variable would be from (42) 

N N 

(45) H] = Y.[F{^,) - F,{xi,) - F\,(x,\,)]VE^'(^0- 

This quantity will differ from zero to the extent that Xj interacts with one 
or more other variables. By examining the values of one can identify 

those variables that participate in interaction effects. For each variable Xj so 
identified, the statistics {Hjk]k^j (44) can be used to identify the variables 
that interact with Xj . Note that only those variables that are deemed globally 
relevant via (28), (29) and (35) need be considered for interaction effects. 
This is often a small subset of all n predictor variables. 

If a particular variable Xj is seen to interact with more than one other 
variable using (44), it is of interest to ascertain the order of these interac- 
tions. That is, whether Xj interacts separately with each of them or whether 
subsets of these variables jointly participate in higher order interactions. 
Let Xk and xi be two variables that are identified as interacting with Xj. 
This could represent separate two-variable interactions between {xj,Xk) and 
{xj,xi) only, or the additional presence of a three- variable interaction in- 
volving (xj,Xk,xi). A statistic for testing these alternatives is from (43) 

N 

^jkl ~ ^ jkli-^ij^-^ik-i^il) ~ Fjk{.^ij : ^ik) 
i=l 

(46) - Fji{xij,Xii) - Fki{xik,Xii) + Fj{xij) 

N 

+ Fk{xik) + Fi{xii)f I ^ Fjf,i{xij,Xik, Xii). 

i=l 

This quantity tests for the presence of a joint three-variable interaction be- 
tween Xj, Xk, and xi by measuring the fraction of variance of Fjki{xj,Xk,xi) 
not explained by the lower order interaction effects among these variables. 
Analogous statistics testing for even higher order interactions can be derived, 
if desired. 

By considering the fraction of unexplained variance, the statistics (44) 
and (46) test for the presence of the corresponding interaction effects in 
the predictive model -F(x) but do not necessarily reflect the importance of 
these effects to the overall variation of i^(x). It is possible for an interaction 
effect to be highly significant (see Section 8.3) but not very influential when 
compared to the other effects in the model. If for interpretational purposes 
one would like to uncover these as well as the more influential interactions, 
these statistics (44), (46) are appropriate. If it is desirable to ignore them 
so as to concentrate only on the highly influential interactions, then the 
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statistics can be modified accordingly. Replacing the denominators in (44) 
and (46) with that in (45) will cause the resulting statistics to more closely 
reflect the importance of the corresponding interaction effects to the overall 
model F(x.). 

8.2. Spurious interactions. The strategy outlined in the previous section 
is applied to the predictive model -F(x) (25) (26). As such, it will uncover 
interaction effects present in that model. However, interest is in interaction 
effects present in the target function F* (x) (2) representing the true under- 
lying predictive relationships among the predictor variables x. It is possible 
that even a highly accurate predictive model can contain substantial in- 
teraction effects that are not present in the target F*{x.). These spurious 
interactions can occur when there is a high degree of collinearity among 
some (or all) of the predictor variables in the training data {xj}]^. 

For example, if the target function exhibits a nonlinear additive depen- 
dence (37) on a variable Xj, this dependence on Xj can be equivalently 
approximated by a corresponding additive contribution to the model in- 
volving that variable alone, or by incorporating interaction effects involving 
other variables highly correlated with it. Thus, it is not possible to easily 
distinguish between low and higher order interactions among subsets of vari- 
ables that are highly correlated with each other. If interpretive value is to 
be placed on the presence of various interaction effects, then such spurious 
interactions should not be reported. 

One way to discourage spurious interactions is to restrict their entry into 
the predictive model -F(x) (25), (26). Interactions enter the model through 
rules (7) involving more than one predictor variable. Such rules are derived 
from trees that have splits on different variables at nodes along the path from 
the root node to the nodes that define the respective rules (see Figure 1). 
Thus, one can discourage the entry of unneeded interaction effects by placing 
an incentive for fewer variables along each such path. 

Trees are built in a top-down greedy fashion where the variable chosen 
for splitting at each node is the one that yields the maximal estimated 
improvement to tree predictions as a result of the split. The improvement Zj 
by potentially splitting the node on variable Xj is estimated for all variables, 
and the one 

j* = arg max Zj 

l<j<n 

is chosen for splitting the node in question. Spurious interactions can be 
discouraged by modifying this splitting strategy so as to place an incentive 
for repeated splits on the same variable. Specifically, 

j* = arg max • Z^ 

i<j<n ■' ■' 
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is used to split the node where kj = 1 if the variable Xj does not appear as a 
splitting variable at any ancestor node on the path from the root to the node 
being split, and kj = k (k > 1) if it does. This places a preference on fewer 
variables defining splits along such paths, and thereby defining the rules 
derived from the tree. In particular, once a variable xj is chosen for splitting 
a node, other variables that are highly correlated with it will be discouraged 
from splitting its descendants and thus appearing with it in the same rule. 
Note that this strategy does not necessarily discourage those variables that 
are highly correlated with xj from entering the overall predictive model 
-F(x) (25), (26). They are not discouraged from splitting nodes in the same 
tree that do not contain a split on xj at an ancestor node, and from being 
used for splitting in different trees. This strategy only discourages highly 
correlated variables from defining the same rule (not different rules) and 
thereby suppresses spurious interaction effects in the predictive model caused 
by collinearity. 

The value chosen for the incentive parameter k should be large enough 
to effectively discourage spurious interactions, but not so large as to inhibit 
genuine interactions from entering the predictive model. It should be set to 
the largest value that does not degrade predictive performance as estimated 
by a left out test set or full cross-validation. 

8.3. Null distribution. In order to use the statistics presented in Section 
8.1 for measuring the strength of various kinds of interaction effects, one 
must have an idea of their value in the absence of such effects. Even if a 
particular interaction effect is absent from the target F*(x), the sample 
based estimate of the corresponding statistic will not necessarily be zero. 
Sampling fluctuations can introduce apparent interactions in the estimated 
model F{x). In addition, there are types of associations among the predictor 
variables other than collinearity that if present can also induce spurious 
interactions in the model [Hooker (2004)] for which the strategy discussed 
in Section 8.2 is less effective. 

Here we present a variant of the parametric bootstrap [Efron and Tibshi- 
rani (1993)] that can be used to derive a reference (null) distribution for any 
of the interaction test statistics presented in Section 8.1. The idea is to re- 
peatedly compute these statistics on a series of artificial data sets generated 
from the training data, and then use the distribution of test statistic values 
so derived as a reference for the corresponding test statistic value obtained 
from the original data set. 

For regression, each artificial data set is given by {xj,yj}]^, where 

(47) yi = FAi^i) + {yp(i) - FA{xp(i))). 

Here {p{i)}i represents a random permutation of the integers {1,2,..., A^} 
and -Fa(x) is the closest function to the target containing no interaction 
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effects. For classification y G {—1, 1}, the corresponding response values are 

(48) yi = 2hi-l, 

where bi is a Bernoulli random variable generated with 

(49) Pr(6, = 1) = max(0, min(l, (1 + Fa(x,))/2)), 

and -Fa(x) is derived using (17). The "additive" model -Fa(x) can be esti- 
mated from the original training data set {xj,yj}]^ by restricting the rules 
used in (25) and (26) to each involve only a single predictor variable. This 
is, in turn, accomplished by restricting the trees produced by Algorithm 1 to 
all have = L = 2 terminal nodes. Other techniques for estimating additive 
models could also be used [see Hastie and Tibshirani (1990)]. 

By construction, each data set generated from (47) or (48) has a tar- 
get Fa{x) containing no interaction effects. It has the same predictor vari- 
able joint distribution as the original training data. It also has the same 
(marginal) distribution of the residuals {yi — -F*(xj)}]^ under the null hy- 
pothesis F*(x) = F4(x). 

For each artificial data set {xj,yj}]^ (47), (48), a full predictive model 
F(x) is obtained by applying the identical procedure (modeling parameters, 
etc.) used to obtain the predictive model -F(x) on the original training data 
{xj,yj}]^. The various interaction test statistics of interest obtained from 
F{x) are computed on -F(x). The collection of these computed values over 
all artificially generated data sets can then be used as a reference distribution 
for the corresponding values obtained from -F(x), under the null hypothesis 
of no interaction effects in the target F* (x) . Illustrations are provided in the 
data examples below. 

8.4. Discussion. The general strategy of using partial dependence func- 
tions to detect and analyze interaction effects can be applied to any function 
F(x), not just to those of the form (25) and (26). All that is required to 
compute partial dependence functions (40) is the value of F{x) at various 
prediction points x. Thus, this approach can be used with "black-box" pre- 
diction models derived by any method providing a way to estimate Fa{'x) 
(47) (49). The strategy for discouraging spurious interactions outlined in 
Section 8.2 can only be used with tree based methods however. Inhibiting 
spurious interactions can help to make the strategy more sensitive to the 
presence of genuine interaction effects in the target F*{x.). 

9. Illustrations. In this section we present applications of the interpreta- 
tional tools described in Sections 6-8 to two data sets. The first is artificially 
generated so that results can be compared to known truth. The second is one 
that is often used as a test bed for evaluating prediction methods. Follow- 
ing Friedman and Popescu (2003), the tree ensemble generation parameters 
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used in Algorithm 1 were v = 0.01 and ?] = min(A^/2, 100 + 6\/iV), where N 
is the training sample size. The average tree size (13) was taken to be L = 4 
terminal nodes. Rules were derived from ensembles of 333 trees producing 
approximately 2000 rules used in (26) to produce the predictive model (25). 
These "default" parameter settings are used here for illustration; it is pos- 
sible that individual results could be improved by selective tuning of some 
of them. 



9.1. Artificial data. This data set consists of = 5000 observations with 
n = 100 predictor variables. To somewhat realistically emulate actual data, 
the predictor variables were generated with ten discrete values Xij S {/c/10}o, 
with the integers k randomly generated from a uniform distribution. Each 
response value was taken to be 

(50) ?/, = F*(x,)+e„ 

where the target function is 

3 

F*(x) = 9 n exp(-3(l - Xjf) - 0.8exp(-2(x4 - x^)) 

(51) 

+ 2 sin (vr • xq) — 2.5(x7 — xg) 

and Ei ~ A(0,cr^) with the value of a was chosen to produce a two-to-one 
signal-to-noise ratio. Note that this target depends on only eight of the pre- 
dictor variables; the other 92 are pure noise variables having no effect on the 
response. The coefficients multiplying each of the terms in (51) were chosen 
so as to give each of the first eight variables approximately equal global in- 
fluence (28), (29), (35). The target function is seen from (51) to involve a 
strong three- variable interaction effect among (xi, X2, X3), a somewhat differ- 
ent two- variable interaction between X4 and X5, a highly nonlinear additive 
dependence on xq, and linear dependencies of opposite sign on xj and xq. 

Applying RuleFit to these data produced a model (25) involving 351 terms 
(rules + linear) with nonzero coefficients. The average absolute error 

(52) aae = — --- 

Exyly - median(?/)| 

was aae = 0.49 as estimated with 50000 independently generated test ob- 
servations. The corresponding error for a model involving main effects only 
(L = 2) (13) was 0.61. Using only hnear basis functions (24) in (25) and (26) 
produced aae = 0.69. Thus, including additive nonlinear terms in the model 
improves prediction accuracy by ~ 12% over a purely linear model, and al- 
lowing interaction effects produces another ~ 20% improvement. However, 
these prediction errors include the irreducible error caused by the additive 
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Table 1 

Simulated example: six most important rules — all predictions 



Imp. 


Coeff. 


sup. 


Rule 


100 


0.57 


0.49 


0.25 < < 0.75 


99 


0.79 


0.15 


Xi > 0.35 and X2 > 0.45 and 13 > 0.45 


83 


-0.81 




linear: X7 


63 


0.61 




linear: a::g 


61 


0.34 


0.51 


0.35 < a;6 < 0.85 


58 


-0.38 


0.25 


X4 < 0.35 and xs > 0.45 



random component £{ in (50). The corresponding errors (20) in estimating 
the actual target function F*{x.) itself are respectively 0.18, 0.43, and 0.58. 
Thus, including interactions improved estimation accuracy by 58% over a 
purely additive model. Of course, with actual rather than artificially gen- 
erated data, one can only estimate (52) and estimation inaccuracy on the 
target (20), while decreasing monotonically with (52), is unknown. 

9.1.1. Rule importance. Table 1 displays the six globally most impor- 
tant terms (28), (29) resulting from the RuleFit model (25), (26) in order 
of their estimated importance. Column 1 gives the respective importances 
scaled so that the highest estimate receives a value of 100. Column 2 shows 
the corresponding coefficients {a,b). For rules (7) the coefficient (a) value 
represents the change in predicted value if the rule is satisfied ( "fires" ) . For 
linear terms (24) the coefficient is its corresponding slope parameter b. The 
third column gives the support (12), where appropriate, for the respective 
rules displayed in column 4. 

Comparing Table 1 with the (here known) target function (51), one sees 
that these six most important terms (out of 351 total) provide a reasonable 
qualitative description of its dependence on the 100 predictor variables. None 
of these terms include any of the noise variables {xi}}p^ . The first and fifth 
rules indicate larger target function values when xq is in the middle of its 
range of values. The second rule produces larger target values when xi, X2 
and X3 simultaneously realize high values. The third and fourth terms reflect 
the linear dependences on xj and xg. The sixth rule indicates smaller target 
values when X4 is small and X5 is large. 

9.1.2. Input variable importance. The upper left frame of Figure 4 shows 
the relative importance of the ten most important input predictor variables 
(35), as averaged over all predictions (28) and (29), in descending order 
of estimated importance. By construction, the target (51) depends on each 
of the first eight variables xi-xs with roughly equal (global) strength and 
has no dependence on xq-xiqq. Even though the standard deviation of the 
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Fig. 4. Input variable relative importances for the simulated data as averaged over all 
(upper left), the 10% lowest (lower left) and 10% highest (lower right) predictions, and for 
the single prediction point {xj =0.5}" (upper right). 



irreducible error e is here one half of that of the target function, one sees 
that none of the 92 noise variables has estimated relative importance greater 
than 5% of that for the eight relevant variables. 

The upper right frame in Figure 4 shows the relative importance of the 
first eight predictor variables plus the two most relevant noise variables 
for a single prediction point {xj = O.S}}'^^ (30), (31), (35). Here one sees 
varying importance for each of the relevant predictor variables with the 
(additive) variables {x^^xj^x^} being somewhat more influential. The lower 
left and right frames respectively show the corresponding relative variable 
importances for the 10% lowest (32), (34), (35) and 10% highest (32), (33), 
(35) predicted target values. Here one sees that variables xi, X2 and X3 
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Fig. 5. Total interaction strength in excess of expected null value of the first ten input 
variables for the simulated data. The lower dark bars represent the null standard deviations. 



dominately influence the highest predicted values, whereas X4^-xs are most 
influential for the lowest predictions. 

9.1.3. Interaction effects. Figure 5 displays the strengths of the interac- 
tion effects involving each of the first ten predictor variables. The height of 
each bar represents the corresponding value of 

(53) H,=H,-Hf\ 

where Hj is given by (45) for each respective variable Xj based on the orig- 
inal data, and H^^^ is the mean (null) value of the same statistic averaged 
over ten runs of the parametric bootstrap as described in Section 8.3. Thus, 
each bar reflects the value of Hj in excess of its expected value under the null 
hypothesis of no interaction effects. The dark bars shown in Figure 5 are the 
values of the standard deviations of the respective null distributions, so 
that one can visually gauge the significance of each corresponding interac- 
tion. The dark bars are plotted over the lighter ones so that the absence of 
a light bar indicates that the corresponding value of Hj is less than or equal 

to one standard deviation above its null mean value H'j^\ 

The results shown in Figure 5 suggest that variables xi, X2 and X3 are 
each heavily involved in interactions with other variables. Variables X4 and 
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Fig. 6. Two-variable interaction strengths of variables interacting with xi (upper) and 
three-variable interaction strengths of variables interacting with xi andx2 (lower) in excess 
of their expected null value for the simulated data. The lower dark bars represent the 
corresponding null standard deviations. 



also substantially interact with other variables, but to a somewhat lesser 
extent. There is no evidence of any interaction effects involving variables 
xe-xio- 

After identifying those variables that interact with others, it is of interest 
to determine the particular other variables with which each one interacts. 
The upper frame of Figure 6 displays the values of {Hik}2^, where 

(54) H,u = H,,-Hfl 

Here Hj^ is given by (44) for the respective variables {xj,Xk) and Hj^ is 
the corresponding expected null value averaged over ten replications of the 
parametric bootstrap (Section 8.3). The dark bars plotted over the light 

ones reflect the corresponding null standard deviations a^j} . 

Here one sees that xi is dominately interacting with X2 and x^ and there 
is no strong evidence of xi interacting with variables other than X2 and X3. 

Since xi is seen to interact with more than one other variable, one can 
proceed to determine the orders of the corresponding interactions. The lower 
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Fig. 7. Two-variable interaction strengths of variables interacting with X4 (upper) and 
X5 (lower) in excess of expected null value for the simulated data. The lower dark bars 
represent the null standard deviations. 



frame of Figure 6 shows {Hi2i}3^ with 

(55) H,u = H^ki-Hf^i 

being the nuh mean adjusted analog of (46), along with {irlg/ls^ (dark bars). 
This plot reveals that x\ and X2 jointly interact with X3, but with no other 
variables, implying a three- variable interaction among these three variables 
but no other three- variable interactions involving xi and X2. 

The upper frame of Figure 7 shows {-ff4fc}fc^4 (54) along with the corre- 
sponding a^^^ (dark) for the first ten predictor variables. Here one sees that 
X4 tends to only interact with x^. The lower frame shows the corresponding 
interaction plot for X5, which is seen to only interact with x^. Thus, X4 and 

interact only with each other and there is no evidence that they interact 
with any other variables. 

The conclusion to be drawn from this analysis of interactions is that 
these data provide strong evidence for a three-variable interaction effect 
between xi, X2 and 3:3, and a two- variable interaction between X4 and x^. 
There is no evidence for any other interaction effects. Note that the noise 
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variables xg and xiq that were judged from (35) to be irrelevant are seen to 
be inconsequential in the analysis of interaction effects and thus need not 
have been considered. 

The particular target function (51) generating these data was chosen to 
illustrate the properties the test statistics used to uncover various types of 
interactions. As such, it involved strong interaction effects among some of 
the variables and none at all among others. Target functions occurring in 
practice seldom have such sharp distinctions. Often the various predictor 
variables tend to be involved in a wide variety of interaction effects of vary- 
ing types and strength, and the goal is to uncover those (if any) that are 
sufficiently important. 

9.1.4. Partial dependencies. Figure 8 displays partial dependence (40) 
plots on selected variables as suggested by the analysis of interactions above. 
For display purposes, all partial dependence functions are translated to have 
a minimum value of zero. The partial dependencies on (xijXa) and (x2,X3) 
are very similar to that shown for (xi,X2) in the upper left frame, and that 
for xg is very similar to that shown in the lower right frame for xj but 
with opposite slope. Comparing these with the actual target function (51), 
one sees that they provide a fairly representative pictorial description of the 
dependence of the response on the predictor variables. 

9.2. Boston housing data. This is a well-known public data set often used 
to compare the performance of prediction methods. It consists of = 506 
neighborhoods in the Boston metropolitan area. For each neighborhood, 
14 summary statistics were collected [Harrison and Rubinfield (1978)]. The 
goal is to predict the median house value (response) in the respective neigh- 
borhoods as a function of the n = 13 other (predictor) variables. Here we 
investigate the nature of the dependence of the response (measured in units 
of $1000) on these predictors using the tools described in Sections 6-8. 

Applying RuleFit to these data produced a model (25) involving 215 terms 
(rules + linear) with nonzero coefficients. The average absolute prediction 
error (52) was aae = 0.33, as estimated, by 50-fold cross-validation. The 
corresponding error for an additive model restricted to main effects only 
(L = 2) (13) was 0.37, and that for a model involving only linear terms (24) 
was aae = 0.49. Thus, the target function appears to be highly nonlinear 
with some evidence for interaction effects. 

9.2.1. Rule importance. Table 2 shows the nine globally most important 
terms (28), (29) resulting from the RuleFit model (25), (26), in the same 
format as Table 1. The most important term by a substantial margin is the 
linear function of LSTAT (percent of lower status population). Its coefficient 
b is negative, indicating that neighborhoods with larger values of LSTAT 
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Fig. 8. Plots of partial dependence functions on selected single variables and variable 
pairs for the simulated example. 

tend to have lower valued homes. The linear predictor AGE (fraction of 
houses build before 1940) has a similar effect to a lesser degree. 

The coefficient a of the most important rule is roughly five times larger in 
absolute value than that of the others and indicates neighborhoods with ex- 
ceptionally high housing values. These neighborhoods are characterized by 
being very close to Boston employment centers (DIS), high pupil-teacher ra- 
tio {PTRATIO), and very small LSTAT. This rule describes only five of the 
506 neighborhoods: two of the six neighborhoods in Back Bay, and all three 
in Beacon Hill. The other rules in Table 2 indicate that neighborhoods with 
larger houses (number of rooms RM) and lower pollution (concentration 
of nitric oxide NOX), as well as larger houses and lower PTRATIO, tend 
to have higher valued homes. Neighborhoods not very close to employment 
centers, combined with smaller houses and higher tax rates (TAX), as well 
as combined with high PTRATIO, tend to have lower valued homes. 
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9.2.2. Input variable importance. The upper left frame of Figure 9 shows 
the global relative importances of the 13 predictor variables (28), (29), (35) 
averaged over all neighborhoods. In addition to those variables presented 
in Table 2, there is some indication that crime rate (CRIM) has some in- 
fluence on housing values. The upper right frame shows the corresponding 
importances for predicting median home value in the single neighborhood 
comprising the town of Manchester (30), (31), (35). Here RM and TAX 
are relatively more influential for this prediction than on average, whereas 
LSTAT is considerably less influential. The lower left and right frames re- 
spectively show the corresponding relative variable importances for those 
neighborhoods with the 10% lowest (32), (34), (35) and 10% highest (32), 
(33), (35) predicted housing values. For the lowest predictions, the variable 
LSTAT dominates, being more than twice as important than any other vari- 
able. For the highest predicted values, RM is the most important variable 
and PTRATIO is nearly as important as LSTAT. Pollution NOX seems to 
be roughly equally relevant everywhere. 

9.2.3. Interaction effects. Figure 10 shows the values of {Hj}f (53), 
along with the corresponding null standard deviations, for the Boston hous- 
ing predictor variables. There is strong evidence for interactions involving 
NOX, RM, DIS, PTRATIO and LSTAT. Here we investigate further the 
nature of those involving RM and LSTAT . 

The upper frame of Figure 11 displays the values of {Hrm ,k}k=iRM (54) 
along with the corresponding null standard deviations. One sees strong ev- 
idence for an interaction effect between RM and NOX and between RM 
and PTRATIO. The lower frame shows the corresponding plot for LSTAT 
indicating substantial interaction effects involving LSTAT and NOX, and 
LSTAT and DIS. Since RM and LSTAT are each seen to interact with 
more than one other variable, one can use (55) to investigate the presence 



Table 2 

Boston housing data: nine most important rules 



Imp. 


Coeff. 


Sup. 


Rule 


100 


-0.40 




linear: LSTAT 


37 


-0.036 




linear: AGE 


36 


10.1 


0.0099 


DIS < 1.40 and PTRATIO > 17.9 and LSTAT < 10.5 


35 


2.26 


0.23 


RM > 6.62 and NOX < 0.67 


26 


-2.27 


0.88 


RM < 7.45 and DIS > 1.37 and TAX > 219.0 


25 


-1.40 


0.41 


DIS > 1.30 and PTRATIO > 19.4 


20 


2.58 


0.049 


RM > 7.44 and PTRATIO < 17.9 


19 


1.30 


0.21 


RM > 6.64 and NOX < 0.67 


18 


2.15 


0.057 


RM > 7.45 and PTRATIO < 19.7 
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Fig. 9. Input variable relative importances for the Boston housing data as averaged over 
all (upper left), the 10% lowest (lower left), and 10% highest (lower right) predictions, and 
for predicting the single neighborhood of Manchester (upper right). 



of three-variable interactions. In this case, however, the analysis revealed 
no evidence for any three- variable interactions involving RM or LSTAT . 
This strategy can be continued to potentially uncover additional interaction 
effects if any. 

9.2.4. Partial dependencies. Figure 12 displays partial dependence func- 
tions (40) on the four variable pairs indicated above as participating in 
two-variable interactions. From these plots one can study the detailed na- 
ture of the corresponding interaction effects. For example, the lower right 
plot indicates that housing values sharply increase when LSAT and DIS 
simultaneously have very small values. 



PREDICTIVE LEARNING VIA RULE ENSEMBLES 



35 



10. Related work. Predictive methods based on rules have a long his- 
tory in the machine learning literature [see Mitchell (1997)]. Quinlan (1993) 
designed a variant of C4.5 ("C4.5 Rules") where the final model consists of 
a set of rules. A single large decision tree is induced and then converted to 
a set of rules, one for each terminal node. Each such rule is subsequently 
pruned by removing the conditions (indicator functions) that improve its 
estimated prediction accuracy. Finally, the pruned rules {rm(x)} are each 
assigned a class label and then listed in ascending order of their estimated 
accuracy. To obtain a prediction at a point x, the single rule highest in this 
list for which rm(x) = 1 is used. Although there are fundamental differences, 
this approach is connected to the work presented here in that a decision tree 
induction algorithm is employed as a greedy mechanism for generating the 
rules. 

A different rule induction paradigm used in classification context is se- 
quential covering, that underlies the Inductive Logic Programming (ILP) 
algorithms [Lavrac and Dzeroski (1994)]. The generic sequential covering 
algorithm induces a disjunctive set of rules by learning one rule at a time. 
After each rule is derived, the algorithm removes from the training data set 
the "positive" examples (specified y-value) covered by the rule. The pro- 
cess is then iterated on the remaining training observations. As with C4.5 
Rules, the generated rule set is ordered and the single rule highest in the 
list that covers a point x is used for its prediction. Actual ILP algorithms 
such as CN2 [Clark and Niblett (1989)], RIPPER [Cohen (1995)], and PRO- 
GOL [Muggleton (1995)] differ with respect to the detailed techniques that 
implement the generic paradigm. 



Boston housing - iriteractlons 





Fig. 10. Total interaction strength in excess of expected null value of the input variables 
for the Boston housing data. The lower dark bars represent the null standard deviations. 
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Fig. 11. Two-variable interaction strengths of variables interacting with RM (upper) and 
LSTAT (lower) in excess of expected null value for the Boston housing data. The lower 
dark bars represent the null standard deviations. 

Although rule based, RuleFit produces fundamentally different models 
than the methods described above, both with respect to the methodology 
employed to derive the final model and the structure of this model. RuleFit 
models (25) are additive in rules (7) and linear terms (24) with optimized 
weights (coefficients), whereas the above methods produce disjunctive sets 
of rules using only one in the set for each prediction. 

Classification ensembles that combine simple "weak" learners that are 
unions of conjunctive rules can be found in algorithmic implementations 
of the stochastic discrimination paradigm [Kleinberg (1996)]. Each weak 
learner is produced by a random mechanism (e.g., a finite union of rectan- 
gular boxes where each box is generated using a random set of variables, ran- 
dom centering, and random length edges) . The corresponding weak learners 
chosen for the final model are required to satisfy certain "enrichment" and 
"uniformity" conditions. Details are presented in Ho and Kleinberg (1996) 
and Kleinberg (2000). [See also Pfahringer et al. (2004)]. As with RuleFit, 
stochastic discrimination combines its base (weak) learners in an additive 
manner. The major differences are the mechanism employed to generate the 
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Fig. 12. Plots of partial dependence functions on selected variable pairs for the Boston 
housing data. 



additive terms and the fact that stochastic discrimination performs a sim- 
ple averaging, whereas the coefficients of RuleFit models are fit through a 
regularized regression (26). 

SLIPPER [Cohen and Singer (1999)] uses the AdaBoost strategy to pro- 
duce a weighted ensemble of boosted RIPPER rules. While generally outper- 
forming standard rule induction methods, this approach tends not to match 
the performance of boosted tree ensembles [Weiss and Indurkhya (2000)]. 
Light weight rule induction [LRI, Weiss and Indurkhya (2000)] uses a sim- 
ple heuristic strategy based on the boosting concept to produce unweighted 
rule ensembles having an equal number of rules for each class. They pro- 
vide evidence that this approach tends to outperform SLIPPER and single 
trees for small rule sets, and with larger ensembles was competitive with the 
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best methods available at the time. Both SLIPPER and LRI sequentially 
induce relatively small rule sets, all of which are used for prediction. Rule- 
Fit initially induces a large number of rules and then employs regularized 
regression (10), (26) to produce and weight the smaller set comprising the 
final predictive model. 

Designed for problems involving binary valued features, logic regression 
[Ruczinski, Kooperberg and LeBlanc (2003)] uses regression to fit the coef- 
ficients of a model that is additive in rules "logic trees." A set of admissible 
operations is defined for modifying the logic trees and the model building 
process involves randomly applying these operations in a simulated anneal- 
ing context. Due to the intensive nature of the computation involved with 
the simulated annealing approach, logic regression can accommodate mod- 
els involving relatively few logic terms. Also, generalizations to numeric and 
multiple-valued categorical variables complicate this approach. 

Closer to the approach presented here is that of Rosset and Inger (2000). 
They constructed binary classification models using (unregularized) linear 
logistic regression where the predictors were taken to be the original input 
variables along with manually selected and modified C4.5 rules based on 
those variables. 

Ruckert and Kramer (2006) propose the use of regularized regression to 
construct weighted rule ensembles for classification. An initial set of rules 
is defined without reference to the outcome variable y. At each step one 
additional rule from this set is introduced into the model in a (user) pre- 
defined order. A regularized regression is performed on the rules currently 
in the model, and a upper bound on the true (population) value of the 
fitting criterion is computed based on its empirical value and the current 
number of rules. These steps are repeated until all of the initial rules have 
been included. The model in this sequence achieving the lowest upper bound 
on the criterion is then used for prediction. Three fitting criteria were pro- 
posed with the margin minus variance (MMV) , with an li constraint on the 
weights being preferred among the three. Principal differences between this 
approach and RuleFit are that the latter uses information in the outcome 
variable y to preferentially generate a good initial rule set (Algorithm 1), 
and the order of rule entry is determined by the data directly through the 
regularized regression procedure (10), (26). 

For interpretation, Breiman et al. (1983) proposed a predictor variable 
importance measure for (single) trees. The relative importance for each vari- 
able was taken to be the sum of the improvements in squared-error risk on 
the training data at each nonterminal node split on that variable. Friedman 
(2001) and Breiman (2001) extended this measure to tree ensembles by sim- 
ply averaging it over the trees appearing in the ensemble ("Gini" measure). 
Breiman (2001) also suggested a permutation based variable importance 
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measure. The relevance of each variable was taken to be the increase in pre- 
diction risk of the model, as averaged over the training data, when the values 
of that variable were randomly permuted among the observations. Like those 
described in Sections 6 and 7, these measures reflect the marginal influence 
of each respective variable in the presence of the other variables. They need 
not reflect usefulness in the absence of other variables. Also, the permuta- 
tion measure is essentially global in nature and is not readily extended to 
produce corresponding local measures at individual predictions (30), (31), 
(35). However, the Gini measure could be so extended. 

Roosen (1995), Owen (2001) and Jiang and Owen (2001) study interaction 
effects in "black-box" models using the functional ANOVA decomposition 
of -F(x) and product measure. Hooker (2004) discusses the limitations of 
using product measure in the context of observational data and proposes 
alternatives that are intended to mitigate this constraint. Our approach 
to interactions based on partial dependence functions (Section 8.1) does 
not involve the functional ANOVA decomposition. Hooker (2004) observes 
that associations among the predictor variables can sometimes introduce 
distortion in partial dependence estimates based on empirical models. This 
motivates our approach of suppressing spurious interactions presented in 
Section 8.2, and using null distributions as derived in Section 8.3 to calibrate 
observed interaction effects. 
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